Basic Plotting

Previous sections have focused on putting various simple types of data together in notebooks and deployed servers, but most people will want to include plots as well. In this section, we'll focus on one of the simplest (but still powerful) ways to get a plot.

If you have tried to visualize a pandas.DataFrame before, then you have likely encountered the Pandas .plot() API. This basic plotting interface uses Matplotlib to render static PNGs or SVGs in a Jupyter notebook using theinline backend (or interactive figures via %matplotlib notebook or %matplotlib widget) and for exporting from Python, with a command that can be as simple as df.plot() for a DataFrame with one or two columns.

The Pandas .plot() API has emerged as a de-facto standard for high-level plotting APIs in Python, and is now supported by many different libraries that use other underlying plotting engines to provide additional power and flexibility. Thus learning this API allows you to access capabilities provided by a wide variety of underlying tools, with relatively little additional effort. The libraries currently supporting this API include:

  • Pandas -- Matplotlib-based API included with Pandas. Static or interactive output in Jupyter notebooks.
  • xarray -- Matplotlib-based API included with xarray, based on pandas .plot API. Static or interactive output in Jupyter notebooks.
  • hvPlot -- HoloViews and Bokeh-based interactive plots for Pandas, GeoPandas, xarray, Dask, Intake, and Streamz data.
  • Pandas Bokeh -- Bokeh-based interactive plots, for Pandas, GeoPandas, and PySpark data.
  • Cufflinks -- Plotly-based interactive plots for Pandas data.
  • PdVega -- Vega-lite-based, JSON-encoded interactive plots for Pandas data.

In this notebook we'll explore what is possible with the default .plot API and demonstrate the additional capabilities of .hvplot, using the same tabular dataset of earthquakes and other seismological events queried from the USGS Earthquake Catalog using its API as in previous sections. Of course, this particular dataset is just an example; the same approach can be used with just about any tabular dataset.

Read in the data

Here we'll read in the data using Dask, which works well with a relatively large dataset like this (2.1 million rows). We'll use .persist() to bring the whole dataset into main memory (which should be feasible on any recent machine) for higher performance:

In [1]:
import dask.dataframe as dd
In [2]:
df = dd.read_parquet('../data/earthquakes.parq').persist()
df.head()
Out[2]:
depth depthError dmin gap horizontalError id latitude locationSource longitude mag ... magSource magType net nst place rms status time type updated
index
0 7.800 1.400 0.09500 245.14 NaN nn00001936 37.1623 nn -116.6037 0.60 ... nn ml nn 5.0 Nevada 0.0519 reviewed 2000-01-31 23:52:00.619000+00:00 earthquake 2018-04-24T22:22:44.135Z
1 4.516 0.479 0.05131 52.50 NaN ci9137218 34.3610 ci -116.1440 1.72 ... ci mc ci 0.0 26km NNW of Twentynine Palms, California 0.1300 reviewed 2000-01-31 23:44:54.060000+00:00 earthquake 2016-02-17T11:53:52.643Z
2 33.000 NaN NaN NaN NaN usp0009mwt 10.6930 trn -61.1620 2.10 ... trn md us NaN Trinidad, Trinidad and Tobago NaN reviewed 2000-01-31 23:28:38.420000+00:00 earthquake 2014-11-07T01:09:23.016Z
3 33.000 NaN NaN NaN NaN usp0009mws -1.2030 us -80.7160 4.50 ... us mb us NaN near the coast of Ecuador 0.6000 reviewed 2000-01-31 23:05:22.010000+00:00 earthquake 2014-11-07T01:09:23.014Z
4 7.200 0.900 0.11100 202.61 NaN nn00001935 38.7860 nn -119.6409 1.40 ... nn ml nn 5.0 Nevada 0.0715 reviewed 2000-01-31 22:56:50.996000+00:00 earthquake 2018-04-24T22:22:44.054Z

5 rows × 22 columns

Using Pandas .plot

The first thing that we'd like to do with this data is visualize the locations of every earthquake. So we would like to make a scatter or points plot where x='longitude' and y='latitude'.

If you are familiar with the pandas.plot API, you might expect to execute df.plot.scatter(x='longitude', y='latitude'). Feel free to try this out in a new cell, but it will throw an error: AttributeError: 'DataFrame' object has no attribute 'plot'. Since we have a Dask dataframe rather than a Pandas dataframe, we need to first convert it to Pandas to use .plot. In order to make the data more manageable for now, we'll briefly use just a fraction (1%) of it and call that small_df.

In [3]:
%matplotlib inline
In [4]:
small_df = df.sample(frac=.01).compute()
small_df.shape
Out[4]:
(21165, 22)

Now we have a smaller dataset with just 21k earthquakes. We can use that to test out our visualizations before ramping back up to the full dataset.

In [5]:
small_df.plot.scatter(x='longitude', y='latitude')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f678004b978>

Using .hvplot

As you can see above, the Pandas API gives you a usable plot very easily, where you can start to see the structure of the edges of the plates (which in some cases correspond with the edges of the continents and in others are between two continents). You can make a very similar plot with the same arguments using hvplot.

In [6]:
import hvplot.pandas
In [7]:
small_df.hvplot.scatter(x='longitude', y='latitude')
Out[7]:

Here unlike in the Pandas .plot() there is a default hover action on the datapoints to show the location values, and you can also pan and zoom to focus on any particular region of the data of interest.

You might have noticed that many of the dots in the scatter that we've just created lie on top of one another. This is called "overplotting" and can be avoided in a variety of ways, such as by making the dots slightly transparent, or binning the data. These approaches have the downside of introducing bias because you need to choose the alpha or the edges of the bins, and in order to do that, you have to make assumptions about the data. For an initial exploration of a new dataset, it's much safer if you can just see the data, before you impose any assumptions about its form or structure.

Exercise

Try changing the alpha (try .1) on the plot above to see the effect of this approach

Try creating a hexbin plot.

Datashader

To avoid some of the problems of traditional scatter/point plots we can use Datashader, which aggregates data into each pixel without any arbitrary parameter settings. In hvplot we can activate this capability by setting datashade=True.

In [8]:
small_df.hvplot.scatter(x='longitude', y='latitude', datashade=True)
Out[8]:

We can already see a lot more detail, but remember that we are still only plotting 1% of the data (21k earthquakes). With datashader, we can easily plot all of the full, original dataset:

In [9]:
import hvplot.dask  # noqa: adds hvplot method to dask objects
In [10]:
df.hvplot.scatter(x='longitude', y='latitude', datashade=True)
Out[10]:

Here you can see all of the rich detail in this set of millions of earthquake event locations. If you have a live Python process running, you can zoom in and see additional detail at each zoom level, without tuning any parameters or making any assumptions about the form or structure of the data. We'll come back to Datashader later, but for now the important thing to know about it is that it lets us work with arbitrarily large datasets in a web browser conveniently.

Note that the .hvplot() API works here even though df is a Dask and not Pandas object, because unlike the other .plot libraries, hvplot doesn't just target Pandas objects. Instead hvplot can be used with:

  • Pandas : DataFrame, Series (columnar/tabular data)
  • xarray : Dataset, DataArray (labelled multidimensional arrays)
  • Dask : DataFrame, Series (distributed/out of core arrays and columnar data)
  • Streamz : DataFrame(s), Series(s) (streaming columnar data)
  • Intake : DataSource (data catalogues)
  • GeoPandas : GeoDataFrame (geometry data)
  • NetworkX : Graph (network graphs)

Exercise

If you are brave and don't mind refreshing your browser tab if it dies, create a scatter for the full dataset without setting datashade=True.

A Note on points

As a final note, we should really use hvplot.points instead of hvplot.scatter in this instance. The former does not exist in the standard pandas .plot API which is why we use hvplot.scatter up until now.

The reason scatter is inappropriate is that it implies that the y-axis (latitude) is a dependent variable with respect to the x-axis (latitude). In reality, this is not the case, as earthquakes can happen at anywhere on the Earth's two-dimensional surface. For this reason, it is best to use hvplot.points for earthquake locations, as will be explained further in the next notebook.

In [11]:
df.hvplot.points(x='longitude', y='latitude', datashade=True)
Out[11]:

Statistical Plots

Let's dive into some of the other capabilities of .plot() and .hvplot(), starting with the frequency of different magnitude earthquakes.

Magnitude Earthquake Effect Estimated Number Each Year
2.5 or less Usually not felt, but can be recorded by seismograph. 900,000
2.5 to 5.4 Often felt, but only causes minor damage. 30,000
5.5 to 6.0 Slight damage to buildings and other structures. 500
6.1 to 6.9 May cause a lot of damage in very populated areas. 100
7.0 to 7.9 Major earthquake. Serious damage. 20
8.0 or greater Great earthquake. Can totally destroy communities near the epicenter. One every 5 to 10 years

As a first pass, we'll use a histogram first with plot.hist on the small data, then with .hvplot.hist on the full dataset. Before plotting we can clean the data by setting any magnitudes that are less than 0 to NaN.

In [12]:
cleaned_df = df.copy()
cleaned_df['mag'] = df.mag.where(df.mag > 0)
cleaned_small_df = cleaned_df.sample(frac=.01).compute()
In [13]:
cleaned_small_df.plot.hist(y='mag')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f676af254e0>

Similarly we can create a histogram of the whole dataset using hvplot.

In [14]:
cleaned_df.hvplot.hist(y='mag', bin_range=(0,10), bins=50)
Out[14]:

Exercise

Create a kernel density estimate (kde) plot of magnitude.

Categorical variables

Next we'll categorize the earthquakes based on depth. You can read about all the different variables available in this dataset here. In the interest of time, we'll use the small dataset and assume that it is representative of all the earthquakes. According to the USGS page on earthquakes depths:

Shallow earthquakes are between 0 and 70 km deep; intermediate earthquakes, 70 - 300 km deep; and deep earthquakes, 300 - 700 km deep. In general, the term "deep-focus earthquakes" is applied to earthquakes deeper than 70 km. All earthquakes deeper than 70 km are localized within great slabs of lithosphere that are sinking into the Earth's mantle.

First we'll use pd.cut to split the small_dataset into depth categories.

In [15]:
import numpy as np
import pandas as pd
In [16]:
depth_bins = [-np.inf, 70, 300, np.inf]
depth_names = ['Shallow', 'Intermediate', 'Deep']
depth_class_column = pd.cut(cleaned_small_df['depth'], depth_bins, labels=depth_names)

cleaned_small_df.insert(1, 'depth_class', depth_class_column)

We can now use this new categorical variable to group our data. First we will overlay all our groups on the same plot using the by option:

In [17]:
cleaned_small_df.hvplot.hist(y='mag', by='depth_class')
Out[17]:

NOTE: Click on the legend to turn off certain categories and see what is behind them.

Exercise

Add subplots=True and width=300 to see the different classes side-by-side. The y-axis will be linked, so try zooming.

Grouping

To use a widget to toggle between classes, use the groupby option, here in a bivariate plot:

In [18]:
cleaned_small_df.hvplot.bivariate(x='mag', y='depth', groupby='depth_class')
Out[18]:

In addition to classifying by depth, we can classify by magnitude.

Class Magnitude
Great 8 or more
Major 7 - 7.9
Strong 6 - 6.9
Moderate 5 - 5.9
Light 4 - 4.9
Minor 3 -3.9
In [19]:
classified_df = df[df.mag >= 3].compute()

depth_class_column = pd.cut(classified_df['depth'], depth_bins, labels=depth_names)
classified_df.insert(1, 'depth_class', depth_class_column)

mag_bins = [2.9, 3.9, 4.9, 5.9, 6.9, 7.9, 10]
mag_names = ['Minor', 'Light', 'Moderate', 'Strong', 'Major', 'Great']
mag_class_column = pd.cut(classified_df['mag'], mag_bins, labels=mag_names)
classified_df.insert(11, 'mag_class', mag_class_column)
In [20]:
classified_df.hvplot.heatmap(x='mag_class', y='depth_class', C='id', reduce_function=np.count_nonzero)
Out[20]:

Here it is clear that the most commonly detected events are light, and typically shallow.

Exploring further

These visualizations just touch the surface of what is available from hvPlot. To see many more examples, study the hvPlot website. The following section will focus on how to put these plots together once you have them, linking them to understand and show their relationships.


Right click to download this notebook from GitHub.